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(54) Method for modelling a probabilistic system 

(57) A Tvj'hrxl clotcrmines approximate probabili- 
ties cf suies ot syslom represented by a model. The 
mDdct includes nodes connected by links. Each node 
rcD'escnts 3ossitie stales of a corresponding part of the 
System rtnd cnch lirk represents statistical dependen- 
cies bcrween possoic st<iiles of related nodes. The 
nodes arc grouped irto nrbitrary sized clusters such that 
every nocJc ts nc uoed n al least one cluster and each 
link IS ccmpictt?d corMnod in at least one cluster. (Vles- 



sages, based on the arbitrary sized cluster, are defined. 
Each message has associated sets of source nodes and 
destination nodes, and a value and a rule depending on 
other messages and on selected links connecting the 
source nodes and destination nodes. The value of each 
message is updated until a termination condition is 
reached. When the tennination condition is reached, ap- 
proximate probabilities of the states of the system are 
detemnined from the values of the messages. 
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Description 

FIELD OF THE INVENTION 

[0001] This Invention relates generally to modeling probabilistic systems, and more particularly, to nnodeling proba- 
bilistic systems using belief propagation in a Markov network. 

BACKGROUND OF THE INVENTION 

[0002] Computer models are frequently used to study the behavior of complex probabilistic systems. When the sys- 
tems contain many inter-dependent random variables. Markov networks are often used. In a Markov network, nodes 
of the network represent the possible states of a part of the system, and links between the nodes represent statistical 
dependencies between the possible states of those nodes. 

[0003] By the Hammersly-Clifford theorem, from the study of Markov networks, the probability of any set of states 
at the nodes of the network can be written as the product of compatibility functions between clusters of nodes. 
[0004] Figure 1 shows a simple network with four nodes labeled a, b, c, and d. The links between the nodes represent 
the statistical dependencies between the possible states of the nodes. For the case of pairwise probabilistic interactions 
between nodes of the network, the overall joint probability of the system'can be expressed as the product of compatibility 
functions for each linked pair of nodes; 

where t^^^ is the compatibility matrix between nodes a and b, Is a random variable describing the state at node a, 
and similarly for the other nodes and variables. 

[0005] Often. Markov networks for practical applications are very large. For example, an image acquired from a 
scene by a camera may be represented by a Markov network between all smalt neighboring patches, or oven pixels, 
of the acquired image. Similarly, the well known "travelling salesman problem" can map onto a Markov network where 
the maximum probability state corresponds to the shortest path of the salesman's route. This network has as many 
nodes as cities to be visited, in some Markov networks, the nodes can represent measured input signals, such as 
visual input data. Markov models are also extensively used in speech recognition systems. 

[0006] To analyze the probabilistic system modeled by a Markov network, one typically wants to find the marginal 
probabilities of certain network variables of Interest. (The "marginal probability" of a variable signifies the probability 
of that variable ignoring , the state of any other network variable.) For example, it may be useful to examine the probability 
of a variable that represents an underlying explanation for some measured data, such as the probability of particular 
words used to vocalize particular speech sounds. To find those probabilities, the Markov network is marginalized over 
all the other variables in the network. This gives the probability of the variable representing the explanation, given the 
measured input data values. This marglnallzation is thus a forni of inference. 

[0007] One may also want to find states of the nodes, which maximize the network probabilities. For example, for 
the Markov network corresponding to the travelling salesman problem, It is desired to find the state at each node which 
maximize the probability of the Markov network. These states, which minimize the length of the salesman's route, are 
known as the maximum a pos(er/on probability (MAP) states,. 

[0008] In the example of Figure 1 , it is possible to determine the marginal probability P(s^) of the variable at node a 
by summing the random values at nodes b, c, and d: 

[0009] In general, especially for large networks, these marginal probabilities are infeaslble to detemnine directly. The 
joint sum over all possible states of all the nodes can be of too high a dimension to sum numerically, particularly when 
the network has closed loops. 

[0010] Figures 2a-b show examples of Markov networks with many loops for which it is difficult to find either the 
marginal probability at a node, or the state of the node which maximizes the overall probability of the Markov network. 
Both networks are in the form of lattices, which are commonly used to describe the joint probabilities of variables 
spatially, distributed over two dimensions. Figure 2a shows a rectangular lattice, and Figure 2b shows a triangular 
lattice. These type of lattice networks are used to model many systems. 
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[0011] Techniques to approximate the marginal probabilities for such structures are known, but these techniques are 
typically very slow. Simulated annealing can be used, or Gibbs sampling, see Geman et al. "Stochastic relaxation. 
Gibbs distribution, and the Bayesian restoration of images," IEEE Pattern Analysis and Machine Intelligence, 6: 
721 -741 , 1 984. Another class of approximation techniques are variational methods, see Jordan, "Learning in graphical 
models." MIT Press, 1 998. However these methods require an appropriate class of variational approximation functions 
for a particular problem. It is not obvious which functions, out of all possible ones, to use for the approximations. 
[001 2] For the special case of Markov networks that fomn chains or trees, there is a local message-passing method 
that calculates the marginal probabilities at each node, see Pearl. "Probabilistic reasoning in intelligent systems: net- 
works of plausible inference," Morgan Kaufmann, 1988, The later method is now in widespread use, and is equivalent 
to the "forward-backward" and Viterbi methods for solving one dimensional Hidden Markov Models (HMM), and to 
Kalman filters and their generalization to trees, see Luettgen et al. in "Efficient multiscale regularization with applications 
to the computation of optical flow," IEEE Trans. Image Processing, 3(1);41-64, 1994. This message-passing method 
gives the exact marginal probabilities for any Markov network that does not have loops. This is referred to as the 
"standard' belief propagation, or message-passing method below. 

[001 3] Unfortunately, many Markov networks of practical interest do contain loops. For example, an image, modeled 
as a Markov network of local image patches connected to their nearest neighbors, gives a lattice structured Markov 
network as shown in Figures 2a-b, also called a Markov random field. This type of network contains nriany loops. 
[0014] Another method for inference in Markov networks applies the local message-passing rules derived for trees 
and chains in a network, even though the network may contain loops, see Weiss, "Belief propagation and revision in 
networks with loops," Technical Report 1616, MIT Al Lab, , 1997. This is referred to as the 7oopy" belief propagation 
method in the description below, although it should be clearly understood that the "loopy" method is nothing more than 
the "standard" belief propagation method applied to a network with loops. When such a procedure converges, it can 
yield an approximate detennination of the marginal probabilities. However, the loopy method sometimes gives too poor 
an approximation to the marginal probabilities, and often does not even converge. In the latter case, the approximation 
gives no single answer for the desired marginal probabilities. 

[0015] Therefore, it is desired, to provide a method for determining marginal probabilities in Markov networks that is 
both relatively fast and more accurate than the loopy method. Furthermore, it is desired to provide a method for networks 
with loops that converges more reliably than the prior art loopy belief propagation method. 

Summary of the Invention 

[0016] The present invention provides a method for determining the probabilities of nodes in a network model of a 
complex probabilistic system. More particularly, the nnethod determines desired marginal or maximum a posteriori 
probabilities in networks with loops. The method uses a message-passing scheme, appropriate for networks with loops, 
which is more accurate and typically converges more reliably and in fewer iterations than prior art loopy belief propa- 
gation methods. 

[0017] The invention describes a class of procedures in which computational cost and accuracy can be traded off 
against each other, allowing a user of the invention to select more computation for more accuracy in a particular ap- 
plication of the invention. 

[001 8] The invention has two major advantages over the prior art loopy belief propagation method. First,- the invented 
method nomnally gives much more accurate answers for the desired marginal probabilities. Second, the invented meth- 
od can converge to a single answer in cases where the loopy method does not. 

[001 9] Instead of finding the marginal probability at a node, one embodiment of the Invention finds the states at each 
node which approximately maximize the probability of the entire network. Thus, the invention provides a novel way to 
approximate both the marginal probabilities and MAP states in Markov networks. 

[0020] Many Markov network problems of interest are known to be NP-hard problems. A problem is NP-hard when 
it is inlrinstcally harder than those problems that can be solved by a Turing machine in nondelerministic polynomial 
time. When a decision version of a combinatorial optimization problem belongs to the class of NP-complete problems, 
which includes the traveling salesman problem described above, an optimization version is NP-hard. The invented 
method yields fast, approximate solutions for some of these very difficult optimization problems, 
[0021 ] More particularly, the invention provides a method for detennining marginal probabilities of states of a system 
represented by a model. The model includes nodes connected by links. Each node represents possible states of a 
corresponding part of the system, and each link represents statistical dependencies between possible states of related 
nodes. 

[0022] The nodes are grouped into arbitrary sized clusters such that every node is included in at least one cluster 
and each link is completed contained in at least one cluster. Based on the arbitrarv sized cluster, messages are defined. 
Each message has associated sets of source nodes and destination nodes, a value for^each state of the destination 
nodes and a rule that depends on other messages and on selected links connecting the source nodes and destination 
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nodes. The values of each message are updated using the associated rule until a termination condition is reached, at 
which point the probabilities of the states of the system are determined from the values of the messages. 

Brief Description of the Drawings 

5 

[0023] 

Figure 1 is graph of a network with four nodes and links; 
'0 Figure 2a is a graph of a square network; 

Figure 2b is a graph of a triangular network; 

Figure 3 is a flow diagram of a method for propagating beliefs in a network according to a preferred embodiment 
'5 of the invention; 

Figures 4a-c are graphs of nodes grouped into arbitrary sized intersecting clusters; . 

Figure 5 is a graph of a nine node network grouped into four intersecting clusters; 

20 

Figure 6 is a flow diagram of detailed steps of the method according to the invention; 
Figure 7 is a graph of a hierarchy of regions; 
25 Figures 8a-c are graphs of belief propagation for a rectangular network; 

Figure 9 is a graph of an identity for the belief propagation of Figures 8a-c; 
Figure 1 0 is a graph of another identity of the belief propagation of Figure 7; 

30 

Figures 11a-c are graphs of belief propagation for a triangular network; 
Figure 12 is a graph of an identity for the belief propagation of Figures 11a-c; 
35 Figure 1 3 is a graph of another identity for the belief propagation of Figures 11 a-c; 

Figure 14 is graph of a nine node network grouped into four clusters; 
Figure 15a is a graph of the network of Figure 14 with the nodes grouped into four clusters; 

40 

Figure 15b is a graph of a super-node network showing standard message passing; 
Figure 15c is a graph showing standard and normalized message passing; 
Figure 15d is a graph of a network with nine super-nodes; and 

Figure 16 is a flow diagram of an alternative embodiment of the belief propagation method according to the inven- 
tion. 

50 Detailed Description of the Preferred Embodiments 
[0024] Introduction 

[0025] Our invention builds on the "cluster variational method" for analyzing cooperative phenomena in statistical 
physics, see for example, the special issue in honor of R. Kikuchi, Progress in Theoretical Physics Supplements, vol. 
55 115,1994. 

[0026] A physical system is often modeled in statistical mechanics as a Markov network, where the states of the 
nodes in the network correspond to the physical states of the particles or spins in the physical system. The cluster 
variational method derives approximations to the "Gibbs free energy" of a physical system. We refer to these as "Kikuchi 
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approximations" to the Gibbs free energy, or the "Kikuchi free energy." From a Kikuchi approximation to the Gibbs free 
energy, one can detemiine properties of a physical system, such as magnetization, specific heat, or the critical tem- 
perature for a magnetic spin system. 

[0027] The simplest Kikuchi approximation derived with the cluster variational method corresponds to an approxi- 
mation first proposed by Bethe, see Bethe. Proc. Roy. See. (London) A 150 (1935) 552. The loopy belief propagation 
method, when it converges, gives marginal probabilities identical to those obtained from the Bethe approximation. 
[0028] Solutions obtained from propagation rules according to our invention are equivalent to solving the appropriate 
physical system under the corresponding Kikuchi approximation. The method of minimizing the Kikuchi free energy 
directly will therefore give identical results to our method, but such a direct minimization will be extremely slow and be 
plagued by the problem of convergence to inappropriate local minimal in the free energy surface. Thus, physics cal- 
culations made with more complicated Kikuchi approximations have typically only been tractable for homogeneous 
models of physical systems. As an advantage, our belief propagation method is also practical for heterogeneous models 
of physical systems where the nodes of the model have arbitrary topology and interdependencies. 

Calculation of marginal probabilities by Kikuchi free energy minimization 

[0029] As shown in Figure 3, we begin with a network 300 representing a system or model. Typically, the network is 
constructed ahead of time by an "expert" in the art field in which the model operates. It is not our Job to construct the 
model; instead, we provide a method for using the model, even if the model is heterogeneous and includes many loops. 
For example, in a medical diagnosis system, the nodes can represent clinical findings, physiological and psychological 
conditions, laboratory results, pathologies, etc. Here, the goal may be to use the model to make the best (most probable) 
diagnosis given observations made for a particular patient. In an error-correcting coding system, the nodes represent 
source bits or parity bits, and the links represent the required relationships between the bits. Here, the goal may be to 
use the model to decode a block of bits that has been corrupted by noise. 

[0030] The model includes a plurality of nodes 301 . A random variable s at each node takes on one of a discrete set 
of states, or it may take on a continuum of states. The nodes are connected to others by links i302 to form the network 
model 300. Typically, a link connects two nodes, however higher-order links can connect more than two nodes. The 
network model 300 can represenl a magnetic spin system, a medical diagnosis system, an image-processing system, 
a speech recognition system a genetic regulatory network, a decoding system, or other complex real-world systems. 
[0031] By definition, for a Markov network with pair-wise statistical dependencies between nodes, the probability of 
a particular assignment of values s at the nodes is given by: 



where the first product runs over all the linked neighboring nodes, /" and j. 

[0032] The o compatibility matrices represent the statistical dependencies between the nodes, as represented by 
the links 302 in the network model 300. For example, in a medical diagnosis system, one node might represent a 
symptom, and the nodes that it is linked to are related symptoms or diseases. The numerical values of the compatibility 
matrices 0, represent the strength of the relation between specific nodes. For example, in a medical diagnosis model, 
some symptoms might be strongly con-elated with certain diseases, i.e. a strong statistical dependency, while other 
symptoms might only have weak statistical correlation with other symptoms or diseases. 

[0033] For some systems, it may be desired to represent the statistical dependencies between the states of the 
nodes by compatibility functions ("links") that are higher-order tensors. For example, if the states of nodes /, /, and k 
are mutually slalislically dependent, but those dependencies cannot be expressed in terms of pair-wise interactions, 
then we introduce the tensor ti)y^^Cs,,Sy,s^jto represent those dependencies. The over-all probability of the system would 
then also include the product over all higher-order tensors as well. The method we describe here will also apply to 
systems with such higher-order links between nodes. All that is necessary is to include terms that arise from the higher- 
order links in the "energy" and "region energies" that we describe below. 

[0034] The V functions represent the "evidence" that each node is in any particular one of its states before wc consider 
any of the other nodes. While the (|» functions will normally never change, the y functions will typically change from one 
use of the model to another. For example, in the case of a medical diagnosis model, the \|;/s,) function represents, for 
a specific patient, the results of a given test. To use the same model for a different patient, we use different evidence, 
represented through the \\i functions, but we do not change the (p functions, which represent the relationships between 
different symptoms, diseases, tests, etc. 

[0035] In equation [3]. Zis a normalization constant that insures that the sum of the probabilities of all possible states 
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of the overall system is equal to one. 

[0036] In order to describe the "Kikuchi free energy" of the model, it is helpful to define some additional functions, 
which are motivated by common conventions from statistical physics. We define the "bond strength" Jj/Si^Sj) between 
two nodes / and j by: 

<^ij{s,Sj)=e ^ ' , [4] 
where 7 is a parameter called the "temperature." We define the "field" hi(Sj) at the node / by: 



y;<s,.)=e [5] 
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[0037] In tenns of these variables, we have: 
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P(Si.S2 s^)=le ^ ' 2 , [6] 

where E is the "energy": 

E{s, , ,.. s^)^Y. h ' ) + S ) • [7] 

[0038] In statistical physics, equation [6], which is equivalent to the original definition [3] of the Markov model, Is 
known as "Bollzmann's Law," and the normalization constant Zis called the "partition function." The partition funclion 
is defined explicitly by 

^-2^^ . [8] 

Following the conventions of statistical physics, we define a quantity called the "Gibbs free energy." The Gibbs free 
energy G is a function of the probability distribution P(SpS2,....Sf^, By design, G is defined so that if it is minimized as 
a function P, then Bolt7mann's Law is automatically obeyed. G is defined by: 

GiPis, , s^\r)^u-Ts-i-rii-^ X ^(^i - — )1 • [9] 

[0039] Here. U \s the expected value of the energy, also called the "internal energy": 
and S is the entropy: 

»y = - Z ^(^P^2--yw)W^p5„..^^)). [11] 

[0040] The last term in equation [9] is a Lagrange multiplier term designed to enforce the constraint that the sum of 
the probabilities over all states is equal to one. In order to insure that this constraint is satisfied, we set the partial 
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derivative of G with respect to the Lagrange multiplier y equal to ?ero. 

[0041] If we differentiate the Gibbs free energy G with respect to the probability distribution P and the Lagrange 
multiplier and set those partial derivatives equal to zero, then we recover Boltzmann's Law. Thus, to understand 
physical systenns. physicists often try to calculate and differentiate the Gibbs free energy. 

[0042] For a large system, it is intractable to calculate Gexactly, so Bathe, Kikuchi, and many others have developed 
approximation schemes. In these schemes, one determines exactly the Gibbs free energy over some small clusters 
of nodes of the network, and then approximates the total Gibbs free energy as a sum over many small clusters of 
nodes, correcting for over-counting in regions where the different clusters overlap. In the Bethe approximation, the 
clusters are precisely those pairs of nodes that are linked in the Markov network, while Kikuchi's more accurate ap- 
proximations, also called the "cluster variational method," allow for clusters of a larger size. 

[0043] The cluster variational method errs by missing contributions to the Gibbs free energy that arise from high- 
order correlations between nodes that cannot be described in terms of the chosen clusters or intersections of clusters. 
The larger the size of the small clusters, the smaller the error. 

[0044] In principle, the cluster variational method allows for arbitrary clusters on arbitrary Markov networks. In prac- 
tice, however, most previous computations involving minimization of a Kikuchi free energy were restricted to symmet- 
rically placed clusters over a homogenous network. Computations on inhomogeneous networks have been performed, 
but only for the Bethe approximation, see Morita, Physica 98A (1979) 566. 

[0045] By "homogenous" networks, we mean networks for which the nodes are arranged in some regular repeated 
pattern, and the inleraclions between the nodes depends only on their position in that patlem. Many magnetic spin 
systems are "homogenous" in this sense, but most of the other types of systems that are modeled with Markov networks, 
such as the example of a medical diagnosis system, are not homogenous. 

[0046] Previous prior art Kikuchi free energy minimization calculations were restricted to symmetricatty placed clus- 
ters over a homogeneous network because, before our invention, there was no practical way to quickly minimize a 
Kikuchi free energy defined for arbitrary clusters over an inhomogeneous Markov network of arbitrary topology. 
[0047] Using the Kikuchi approximate fomn for the total Gibbs free energy, we can approximately calculate the mar- 
ginal probabilities of the variables at the nodes, and also the marginal probabilities of collections of nodes. We extend 
this technique to develop a message-passing method that converges to solutions corresponding to a Kikuchi approx- 
imation of the Gibbs free energy. A message-passage method, as an advantage over the prior art Kikuchi free energy 
calculations, is very much faster and more practical for inhomogeneous networks. 

Kikuchi free energy calculation for simple example 

[0048] We use Figures 4a-c to illustrate our generalized message-based belief propagation method. This example 
network shows how we develop a message-passing method that has a corresponding Kikuchi approximation result as 
a fixed point of the messages. That is to say, our message-passing method will converge to a solution that is equivalent 
to the solution of the Kikuchi approximation. 

[0049] Figure 4a is a graph of many nodes grouped into three clusters. The clusters intersect each other. The nodes 
and their links are not shown, and indeed, the number of nodes per cluster can be quite large. In real world physical 
systems, one usually chooses many more than three clusters for a Kikuchi approximation. Figures 4b-c depict two 
possible instantiations of the three clusters and theirJntersections, 

[0050] As described above with respect to Figure 3, we group 310 the nodes of the network 300 into arbitrary inter- 
secting clusters 31 1 . A cluster is said to be intersecting if a single node appears in more than one cluster. Because we 
do an arbitrary clustering, we can focus on regions of the model that might be more significant to an exact solution. 
We require that every node is included in at least cluster. We also require that every link in the Markov network is 
completed included in at least one cluster. The first constraint ensures that our clusters represent and consider all 
nodes of the model, and the second constraint ensures that the clusters will intersect. 

[0051] Figure 4a shows abstractly three clusters of nodes (regions /, 2, and 3), and their intersections (regions 12, 
31, and 23), and intersections of intersections (region 123). 

[0052] We use the following temninology. The largest regions are referred to as "clusters." The clusters, along with 
their intersections, and the intersections of those intersections, and so on, are referred to collectively as "regions." A 
"sub-region" of region ris a region that consists of a proper subset of the nodes in region r. Region ris a "super-region" 
of region s if and only if region s is a sub-region of region r. 

[0053] The number of nodes in each cluster can vary, and be different from each other. As stated before larger 
clusters will yield better results than small clusters albeit at a greater computational cost. Note that region 12 is inter- 
preted to include region 123, just as region 1 also includes region 123, 

[0054] I n equation [1 2] below, we approximate the total Gibbs free energy for the system 300 by the sum of the free 
energies of each of the clusters. Then, we include correction ternis by subtracting the free energy of intersection regions 
where we over-counted the free energy We may include additional correction terms for the over-counted intersections 
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of the intersection regions, etc. In addition, we impose, with Lagrange multipliers, the additional constraints that the 
probabilities of each region sum to one, and that the probabilities have the correct marginallzation relationships with 
respect to each other. 

[0055] We express the Gibbs free energy in terms of the probabilities of each of the regions. We let a^be a particular 
assignment of node values within the cluster region /. This is what we called Sp s^, .... Syy, in equation [3] above. Our 
notation for region intersections is to concatenate the indices of the intersecting regions, so a^yi^ is a particular assign- 
ment of node values within the region of intersection of regions /, / and k. 

[0056] The "region energy" E(a/) includes those tenns in the overall energy thai can be attributed to nodes or links 
contained entirely in the region /, when the nodes In the region are set to the particular values denoted by ay. Note that 
if there are higher-order tensor "links" in the definition of the system, the region energy will still contain those links that 
are contained 

[0057] The "region probability" P(a,) is the marginal probability that the nodes in cluster /take on the particular values 
denoted by «/. In our notation, the normalization constraint for region / is Ij^^Pia;) = 1 . That is, the sum over the prob- 
abilities of all possible value assignments to the nodes of region / is one. An analogous equation holds for all regions. 
These are the Lagrange multiplier constraints y in equation [12J, below. 

[0058] The second set of constraints for the Gibbs free energy in equation [12] are the marginallzation constraints, 
associated with Lagrange multipliers X. The Lagrange multipliers impose consistency constraints between the proba- 
bilities of the regions and their intersecting sub-regions. The probability of a region, marginalized over all the nodes in 
the region not in some intersection region, must equal the marginal probability of the nodes in that intersection region. 
In our notation. Z^^^^ = P(u/^). where Z^^.^^ means the sum overall nodes in region / that are not in region / 
[0059] For the three clusters of nodes shown in Figures 4b-c, we express the Gibbs free energy in the Kikuchi ap- 
proximation as G ^2,3 We^have three sets of U-S Tterms, first a sum over states a, for the cluster region free energies, 
second the terms involving a,y where we subtract over-counted intersection regions, and a third term involving a ^23 
where wo correct for over-subtracting that intersection region. After all these corrections are made, each subset of 
each region is counted exactly once^ as it should be. 



G\,2j - {U — S T) terms + y constraints + X constraints 
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55 [0060] One point aboji ihc mnrginali^ation constraints. One might wonder why we do not also have terms like 
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which enforce that region 1 properly marginalizes down to region 123. We onriit these terms because they are redundant 
with the marginalization equations already included in 2 3- 

[0061] Marginalization is a linear operator. Of the three possible constraints relating regions /, ij, and ijk, that region 
/marginalize down to region ij, that /; marginalize down to ij'k, and that region /marginalize down to ijk, we only need 
two, and the third is linearly dependent on the other two equations. 

[0062] The Kikuchi approximation to the Gibbs free energy can be generalized to the case of an arbitrary collection 
of regions in a relatively straightforward way, but we need to establish some additional more notation. Let r denote a 
region chosen from the set of all relevant regions R. Let S(r) denote the set of sub-regions of rand T(r) denote the set 
of super-regions of r Define a "direct sub-region" of r as a sub-region that does not have a super-region that is also a 
sub-region of r. Thus, in our three cluster example, regions r^and 13 are direct sub-regions of region 1, but region 
123 is not. Let Sj(r) denote the set of direct sub-regions of r and TJr) denote the set of direct super-regions of r. 
Associate with each region ra "over-counting number" defined recursively by 



d,=l-^d,. [13] 

20 t^Ur) 

[0063] The largest regions with no super-regions are defined to have a double-co^jnting number of one. In the three- 
cluster case, we have d^-dg^dj^l, d^2^d23=d2^^'1, and d^23='^' over-counting number is needed to insure that 
each link is ultimately counted exactly once. 
25 [0064] The generalization of equation [12] for the Kikuchi approximation to the Gibbs free energy for the case of an 
arbitrary collection of regions is: 



[14] 



where and are the internal energy and entropy of the region r. 

[0065] We want to find the cluster region probabilities that minimize the Kikuchi free energy while obeying the mar- 
ginalization and nonnalization constraints. These probabilities are the correct marginal probabilities under the Kikuchi 
approximation, 

[0066] To find equations for those probabilities, we differentiate equation [14] with respect to each of the region 
probabilities for each possible assignment of values to a region and set the result to zero. The resulting equations, 
^0 illustrated for the three-cluster example of Figure 4a, are: 

£(ai)+r+rin P(ai)4-Yi=>.ni2(ai2)+>^iV3i(«3i) V^] 

E(a2KT4-r In P{a.^)^y2 = A.2,i2(ai2)+>.2\23(«23) [16] 

E{a^)+T4.T\n P{a:^)-i-y3 = X:^s23i^23)+h^3^(^3^) [17] 

-E(ai2)-7--rin P(ai2)+Yi2=-^lM2{ai2)-X2,i2(«i2H>^12\123{«123) [18] 
'E(a22)'T~T\n P(a23)+723=-^S23(«23)->-3\23(«23)+^23\123(«123) [19) 



55 



10 



EP 1 160 678 A2 



- £(0.3, )- r- 7 In P(a3i )+y3i =-X,^i (aj, H^^^ (ag, hX^^^^s^ia^zz) 



[20) 



E(ai23)+r+rin P{ai23)+ri23=->-i2v123(«123)-^23\123(«123)-^3ni23(«123)- 



[21] 



[0067] Exponentiating both sides and rearranging ternis of each equation, we have the following set of equations 
for the marginal posterior probabilities: 



P(Q!;) = fc,exp 



exp 



[22] 



?(£rj) = A:2exp 



exp 



[23] 



P(a^) = ^:3exp 



[24] 



^(«i2)=^i2exp 



exp 



'A/n(Qi2) +^^2(0^2) ->^am(Qi23) 



[25] 



^(fl£u) = fcuexp 



^(53) ^/23(Q73)+-^(Q^)-^m(Q^23) ' 



r 



exp 



[26] 



?(a„)=^3,expf-:^lexP 



f ^/3|(Qb|)-'-Av3l(Q^l)-^lM23(Q^23) 



[27] 



[28] 



where the k's are normalization constants, different for each equation, that normalize each marginal probability. Each 
of these equations is actually a collection of equations. Thus, if we focus on equation [22], we get one equation for 
each instantiation of a^. On the right hand side of this equation, a^^and aj, are completely determined by a,, because 
regions 72 and 31 arc sub-regions of region 1. 
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[0068] 0( course, we also need to differentiate the Gibb's free energy with respect to the X and y Lagrange multipliers. 
Doing that gives us the desired marginalization and normalization conditions for the region probabilities. 

From Lagrange multipliers to messages 

[0069) In ihc previous section, we derived a set of self.consistent equations for region probabilities and Lagrange 
mjliiphcrs by differentiating the Kikuchi approximation of the Gibbs free energy. These equations are a solution to the 
problem of finding marginal posterior probabilities at the level of the Kikuchi approximation. 

[0070J In ou' new method, we develop equivalent and more convenient formulations of the same solution, in terms 
of H convcrgcrt 'ncs sagc-passing" method. By defining 320 the Lagrange multipliers, see step 320, in terms of another 
set 0' qunniiiics called "messages," we obtain a different but equivalent set of equations for the cluster probabilities. 
[0071] Our method works by treating the self-consistent equations as message update rules 321 that are iterated 
until a icrTTunaiion conaition is reached, for example, a fixed number of iterations, or the solution converges 330, at 
which pomi trie marginal posterior probabilities 332 can be determined from the converged messages 340. Before the 
Iirsi update wr assign niiial values 331 to the messages 321 . The initial values can be random or special values, e. 
g. ali random positive numbers, or all ones. 

[O072] In general :nc'c arc many equivalent sets of message definitions and message-update rules that correspond 
10 Ihc same Kikucn. r.ppo i ^nation Some particular message definitions and message-update rules, which we describe 
iHlcr. have paMiCjMftv '^•tv p*ou'cri es and reduce to the already known ordinary message-passing methods when 
applied to the Ek-iMc rn^'u»*i aitun lit this section, we want to describe the general properties that should be satisfied 
by messages so p^fticuiar message-passing method is equivalent to the corresponding Kikuchi approximation. 

[O073] Wc hHvc uica t^c m^rqmaii/aticn relations to relate region probabilities to probabilities of their intersection 
regions For the :h'ec ciustc cuampto equations [22]-[28] describe how the probabilities are related to the region 
energies aid the ncym.,(./.iion .md m.uginnlization Lagrange multipliers. We ignore the normalization Lagrange mul- 
ttpltofs (the y's) iof no* nq that ihoso Simply give a multiplicative scalar which normalizes the final probabilities. 
[0074] Ourgoa i* tc cn^rcit mc&sagcs in terms of the X Lagrange multipliers. It will turn out that the marginalization 
ccnslraini rotations wh; t^cn rjfvc us a sot of fixed point equations for the messages, which we will interpret as the 
message und^tp fii»o<. .17 ^ 

[0075] We s-ac* hI of tno y 1 ^r/ango multipliers to create a vector X. In this vector, there is one X Lagrange multiplier 
fo' each Imoariy nacponocnt mrtrgnali/alion constraint, times the number of different possible variable assignments 
in the cluster region tcirq margmati/od down. For our three cluster example, the vector X includes nine sets of Lagrange 
mjinphcrs >.,..,vu ,;J A, j./iij,,). 12 (a^o)^ >-Wa23^. '^2/23(0-23^^ ^3/31(0.31). h 2/1 23(0^1 23^. ^23/23(^123)^ and X^^/^os 
which wc laKctc DC si.tckcdm that order. The total dimensionality of the vector X is m2 0^2 + 2 Dss-i- 2 D^^ 
*3D,-,Twhorc *or Olympic D,-> is the dimensionality of a,^. 

[O076] In our example cqoaiions |22-28] give the cluster probabilities in terms of the Lagrange multipliers X. We 
lake the logarithm of tncsc equations, and define the local quantities 



L(a) = E(a) + T InP(a) [29] 
for each cluster 'hen summanze these equations by: 



t = AX, [30] 
where t is a vccicr oi ir^c /. s m ihc seven regions taken in the order 1, 2, 3, 12, 23, 31, and 123, and I has dimensionality: 

[0077] .4 is a mairw wan rows and D)^ columns, but it has a special block form, which we express as: 
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[31] 



15 



20 



[0078] Each entry in this matrix actually denotes a sub-matrix. For example, the i in the upper left corner denotes a 
sub-matrix o( D, rows and D^o columns which is equal to one when the nodes in a^^ agree with the corresponding 
nodes in (/,. and is equal to zero otherwise. This becomes clear in a more specific example. 
[0079] ir region 1 includes two nodes Sg and S^, and region 2 includes the single node S^, and each node can be 
in iwo siriics lhal we label "1" and "2," then a, can be in any of the four stales 3^-1, S^=1: S^=1, 3^=2; 3^=2, S^=/; 
^ 2. St, 2cina u 12 can be any of the two states 5^,= 1; 3^=2, and the sub-matrixi in the top left corner of A is expressed 

as 



25 



30 



^\ 0^ 

0 1 

1 0 

0 1/ 



35 



40 



45 



[0080] In general :hc sctf-consistent equations derived from a Kikuchi approximation can be summarized in the form 
of mntf iM cquHhon [ - Ak and a set of nonmalization and marginalizatlon conditions. 

[0081] Wc ccsifc an alternative formulation for our solution In terms of "messages" and self-consistent message 
ccuHiions Wc still use the same normalization and marginalizatlon conditions on the cluster probabilities, but we 
rc:lofinc |f^c L^qr-inqc multipliers X in terms of messages. 

[0082] Wc store (he icgrththms of all the messages between the regions in an ordered vector m. To be commensurate 
wfih mc iHqfMnqc^mutiipiicrs that we seek t^summarize, we normally let the dimensionality of m be the same as the 
dimensionality of a In general, the vector m can have different dimensionality than X, but the vector needs to span 
ijio same sub spricc of constraints on the lattice so that we can re-write the Lagrange multiplier 5[ in tenns of messages 
m 

[0083] Thus for Out example, we generate sets of logarithms of messages commensurate with Z as m^^^2(^^2)' 

Jif'jf' ^^^33(^^23)' ^3\23(^2^h.^3\3li^3l)> ^12\123(^123)' ^23X123(^123)' ^nd m3^^i23(^123)- The vector 

m siofcs mcsc messages in that order. 

[0084] There is a sot of self-consistent equations for the cluster probabilities in terms of the messages where the 
ccualions have the mainx form: 



50 



L^Bm. 



[32] 



or oquivalcntly. 



55 



/A A. = 8 m, 



[33] 



where S is a "sclcctton" matrix that determines the choice of the message-passing method. This matrix is specified 
below. 
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[0085] The matrix B and the vector m constitute a valid reformulation of the Kikuchi solution as long as the system 
described by equation [33] has at least one solution that can be written in the form of a matrix equation: 



m = CX. 



[34] 



[0086] Any valid reformulation of the Kikuchi solution can be used as the basis for our message-passing method. 
[0087] To_iIlustrate this idea, we return to our three-cluster example of Figure 4a, for which we have already chosen 
the vector m . We can now choose many different matrices S, but we use one that arises from a scheme that we 
describe in more detail below. 

[0088] Recall that the m's defined above are the logarithms of messages, which we denote by Ms. We now set the 
marginal posterior probabilities of the regions to obey the set of equations: 



75 



J p 3\31 \"3\ 



[35] 



20 



[36] 



25 



P(.aj) = kj exp - 



^ KS\ )M 1\2J i^H )^ 23U23 (^123 ) 



[37] 



30 



35 



^(^12 ) = *11 eXp^- ^^Y^ ^"^^ ^^'"2 (^12 )^23V.23 (0^123 Wj.xm (^.23 ) [3 8] 



40 



Pia2,) = k:a "P 



-^^^jw3V23(«23W3V23(a23)J^,2V,23(a,^W3nm(a,23) [39] 



45 



50 



f EM 



23M23 



[41] 



55 



where again the ks are normalization constants that vary for each equation, as for equations [22-28] above. 
[0089] Taking the logarithm of these equations, and recalling the order of the rris given above, we see that equations 
[35-41] are equivalent to the choice 
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[42] 



'5 [0090] The choice of Sand m give an equivalent refornnulation of the region probabilities when there is a solution to 
the equation: 



20 
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35 



55 



m^CX, [43] 

where 



C = B^A. 



[44) 



and is the pseudo-inverse of 6. Of course, we can always compute a C matrix as above, but for the matrix to 
represent a true solution, we must verify that matrix equation 



30 BC^A 



[45] 



is satisfied. 

[0091] If a solution XoA% = Bm exists for a particular choice of matrix 8 and message vectors m , we discard the 
original Lagrange multipliers X, and work entirely with the messages. In this case, we guarantee that the self-consistent 
equations for the messages will ultimately give the same cluster probabilities as those given by the Kikuchi approxi- 
mation. 

[0092] The next step is to determine the self-consistent equations for the messages. To do this, we need to combine 
the equations for the cluster probabilities in terms of messages with the marginalization conditions on the cluster prob- 
abilities. In terms of our example, we use, for instance, the equation P(ai2) = lams ^(oci) along with the equations for 
P(ai) and P(ai2) terms of messages to obtain 

where k is again a normalization constant. In all, there will be one such equation for each marginalization constraint. 
[0093] We use the Dj^ equations of the fomn of equation [46] in our message-passing method, by interpreting them 
as message-update rules. In this interpretation, any messages on the right-hand-side of an equation are treated as 
the messages at iteration t, while the messages on the left-hand-side of the equation are the desired messages at 
iteration t+1. We start, for example, with random messages 331 , and at each iteration replace the "old" messages on 
the right hand side of the equation with "new" messages from the left-hand side. In practice, the messages usually 
converge to a fixed point. When they converge, the fixed point corresponds lo the solution in which we are. interested. 
[0094] As for all methods of this type, we can update the messages synchronously or asynchronously, and we can 
update the messages with some linear combination of the "old" and "new" messages. Our method is an iterative method 
for the solution of nonlinear equations, and as such, its convergence properties may be improved by using standard 
prior-art techniques for such methods, see "Iterative Methods for Linear and Nonlinear Equations", C. T. Kelley, 1 995. 
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[0095] Regarding the normalization constants k, we could continue to set these normalization constants by requiring 
that the total cluster probabilities sum to one. However, another procedure that works equally well, and is easy to 
implement, temporarily scales the messages by requiring that the sum of all the components of a message to equal a 
given arbitrary constant while iterating the self-consistent equations. Reasonable choices for the constant are one. or 
the dimensionality of the message. Thus, for instance, we can require that: 



[0096] After iterating the message equations to convergence, we re-scale the messages to guarantee that the region 
probabilities are properly normalized. 

[0097] We have described the messages as being associated with a region rand one of its direct sub-regions s. We 
can cquivalently describe the message as going from a set of "source" nodes to a set of "destination" nodes. The 
"source" nodes are those nodes in region rthat are not in direct sub-region s, while the "destination" nodes are those 
nodes in s. 

[0098] To obtain probability functions defined on a subset of the nodes in a region, one simply marginalizes the region 
probability by summing it over those nodes that are in the region but not in the subset of interest. 
[0099] Wo can also determine the maximum a posteriori probability values of the nodes in a region rather than the 
region probabilities. For the prior-art "standard" belief propagation method, a simple modification suffices to obtain the 
approximate MAP values of the nodes. The necessary modification is to replace all sums over node values in the 
message update rules with the value of the maximum, i.e., the "argmax," over the same node values. The same mod- 
iftcaion can also be used to obtain the MAP values of regions for our method. Alternatively, wc can use the physics 
foTTiulation of the Markov network, and set the temperature 7to a very small positive constant. In the limit T-> 0, we 
rocovof region probabilities that are entirely dominated by the MAP values of the nodes. 

[0100] The mossago-update equations for some arbitrary B matrix is not, in general, convenient to solve. With an 
amiifHry Ft matrix one might find that a message between two nodes on one side of the network would depend on 
mf^ssHfjos between other nodes that are very far away. This is an undesirable property, both because it seems unin- 
!u tive rtHd because it means that solving the message-update equations will involve matrix multiplications with very 
large matrices Finding the free energy for arbitrary S matrices also involve large matrix multiplications or inversions. 
(0101 J In order thai (icraiing the message update equations be an efficient operation, we desire that the matrix B is 
sparse nnd local i c only a few non-zero elements for each row or column, and only nearby messages influence each 
oner 

(0102) In the next section, we describe a "canonical" message-passing method that yields these desired properties. 
A Canonical method 

[0103] In trie previous section, we described the properties that must be satisfied by messages and their associated 
mcssaqo- update rules whch we call the "message-passing method," In order that the messages give results equivalent 
10 mc KKuchi approximation. In general, there may be many different message-passing methods that are all equivalent 
to me same Kikuchi approximation. In this section, we describe how to select one particular message-passing method 
thai wc call the "canDncal' method. 

[0104] Our canonical method gives the same message-update rules as the standard loopy message-passing method 
whcr applied at the level of the Bethe approximation. More generally, and as an advantage, our canonical method 
loads to /or^/ message-update njles that are relatively easy and convenient to solve because they correspond to a 
sparse 6 rualf x 

[0105] We start wiih a Markov network with local compatibility functions ((>,y and evidence functions y^. The overall 
probability of a configuration of states s^, s^, ... is expressed as: 




[47] 




[48] 



where the first product runs over all linked neighbors, / and / 

[0106] Using Figure 5, we illustrate this procedure with a specific example. In this example, nine nodes labeled a 
through /, are arranged in a 3x3 lattice 500. The only statistical dependencies or links ((>/,<Sy.Sy) included are the ones 
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connecting nearest neighbors in the lattice. 
General procedure 

[0107] The detailed steps of a general nnethod is described with respect to Figure 6. In step 601 , a network 600 of 
nodes and links is grouped 601 into clusters or "super-nodes" 650 to be used in the approximation. These clusters can 
have a sinnple geometric shape, repeated over the lattice. Alternatively, the clusters can be triangles or squares of an 
irregular network, or the clusters can be some other arbitrary choice of node clusters best suited for the approximation. 
The set of clusters must cover all the links in the Markov network, and the clusters must be small enough so that we 
are able to perf onn the generalized belief propagation operations described below. 

[0108] In our example, we group the network 500 into clusters: (abde), (beef), (degh), and (efhi) 501-504 of Figure 
5. This set of clusters "covers" the lattice 500. That is, every link between nodes which influence each other is completely 
included in at least one of the four clusters we have selected. Of course, we can select a different collection of clusters 
which completely covers the lattice. That would just be a different Kikuchi approximation. 

[0109] As stated before, the size of the clusters is proportional to the accuracy of the approximation. If we chose a 
cluster (abcdefghi), i.e., all the nodes in the system, then our approximation is exact. While that grouping may be 
feasible with such a small example, in general, we will perform summations whose number grows exponentially with 
the size of the cluster. Therefore, in a practical application, the number of nodes in a cluster is limited. 
[01 10] The prior art Bethe approximation is obtained when we select as the clusters all the pairs of nodes that influ- 
ence each other In our example, that would mean partitioning the nodes pairwise into clusters (ab), (be), (de), (ef), 
(9^). (^f), (3d), (be), (cf), (dg), (eh), and (fi). Because our four clusters are larger than the twelve clusters of the Bethe- 
approximation, we obtain more accurate results. 

[01 1 1 ] In step 602, we identify a collection of relevant regions 651 by recursively generating intersections of clusters, 
for our example, the regions, (be), (de) (fc), (he), and (c). Wc discard duplicate regions. 

[0112] In step 603, we arrange the regions into a top-to-bottom hierarchy 652 of their intersections. The hierarchy 
for our specific example is shown in more detail in Figure 7. This hierarchy has the property that each region is connected 
only to its "direct sub-regions" and "direct super-regions." A sub-region s is a "direct sub-region" of another region rif 
there are no other regions in our collection that are super-regions of s and sub-regions of r. In our example, regions 
(be) and (de) are both direct sub-regions of region (abde), but region (e) is not. Region ris a direct super-region of 
region s if and only if region s is a direct sub-region of region r. 

[0113] In step 604. we identify messages M 653 which connect all regions with their direct sub-regions. In our ex- 
ample, we would identify messages M^^^,^^, A^aw.we. f^bcente^ M^efMa^ Mj^^,,^, M^.^^^,, M,f^,,,„ M,f^^,. M^^^, 

^de\e' Hole. M^^^^. 

[0114] We can identify these messages using a more compact notation that we illustrate with one example: 

^^^^abde\be- [47] 

[0115] For this message, we send a message from the upper nodes a and dXo the lower nodes b and e. We can 
consider the upper nodes to be the "source" nodes for the message, and the lower nodes to be the "destination" nodes. 
Each of the messages between a region and its direct sub-region has dimensionality corresponding to the dimension- 
ality of the sub-region. Thus, /Wg" has the same dimensionality as region (be). When we want to emphasize that fact, 
we express the message as M^(s^, s^). 

[0116] In step 605, we construct a probability function P(af^ and a local potential Ofa^; for each region R. We 
construct a vector t out of the quantities L(a)s In P(a) - tn a>(a). For our example, we need to construct probabilities 
P{Sa»H.s^.s^), P(s^, s^s^,Sf), P(s^,s^,Sg,Sf,), P(s^,Sf,s,,,s,), P(St^,sJ, P(s^,sJ, P(Sf,s^), P(s^,s^). and P(s^), The local 
potentials a>(afl) are constructed by combining all the original potentials cpy and \|/, that belong to the region R. Thus, 
for example, 

while <l>{Si„Sg)=<|)i^(Sb.s^)\t/5(Sb)x|/e(s^) and 0(s^)=v'a(5e)- 

[0117] Each probability function has the dimensionality of the region it describes. For example, if node bean be in 
any of three states and node e can be in any of four states, then P(a^e) is described by 3*4=12 numbers. 
[01 18] In stop 606, wc specify the matrix B 656 that determines the selection of the message-passing method. This 
matrix will operate on the vector m created from the logarithms of the messages specified in step 604. The matrix B 
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has a block form that we explained when describing equation (33). This special block form "hides" the details of the 
different possible values of the nodes within each message or region. Using this block fomn, the matrix B 655 has A/^ 
rows and Nf^ columns, where A/^ is the number of regions, and A/y^, is the number of messages. Each entry in the matrix 
is either aomatrix or al matrix. We determine the entries of the matrix S by the following sub-step 606 for each entry: 

(a) Label the region corresponding to the row of the entry Rf,, and the region and sub-region of the message 
corresponding to the column of the entry Rq and Sq respectively. By \ S^, we denote the set of nodes that are 
in Rq but not in Sq. 

(b) If the nodes in Sq are the same as the nodes in R^, or if the nodes are a subset of the nodes in Rpf, and if all 
the nodes in Rq \ S^are not in fl^. then the entry is a J matrix. Otherwise, the entry is a 6 matrix. 

[0119] Explained more intuitively, this rule says that every region gets messages that start at source nodes that are 
outside of the region and end at destination nodes that are inside of the region. In our example, assuming the regions 
are listed in the order (abde), (beef), (degh), (efhi), (be), (de), (fe), (he), (e), and the messages are ordered as we listed 
them above, the matrix Bis:. 



00101000001 1 

ToooooTooToo 
oToooooTToTo 
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[49] 



0 0 0 1 0 0 1 0 1 1 0 1 



000001011110 

ooooooooTTTT 



[0120] We explain one of these entries: the T matrix in the first row, third column. The first row corresponds to the 
region (abde), while the third column corresponds to the message M^[. Both the nodes cand /are outside of the region 
(abde) and both the nodes b and e are inside the region (abde), so the conditions for choosing ajmatrix are met for 
this entry. 

[0121] One important advantage of this mle is that when it is applied to the messages and regions obtained from 
the Bethe approximation, the rule will always result in a matrix Sthat will ultimately yield the standard "loopy" belief 
propagation. Anothenmportant advantage is that this rule will normally give back a matrix B that results in a system 
of equations At^Bm that is solvable. Another advantage is that this rule will result in a Smatrix that is sparse and local. 
[01 22] In step 607, we exponentiate the equations represented by the matrix equation I = Sm to obtain the equations 
656 relating the region probabilities to the messages. For our example, we obtain nine equations from the nine rows 
of the matrix equation. As an example, we show how to construct the equation for the first row. The first row of the B 
matrix corresponds to the region (abde). Reading across the first row, we see that there are I sub-matrices in the third, 
fifth, eleventh, and twelfth columns, which correspond to the messages M^^ , Mi^, , and /if . Therefore the eauation 

■J be' de' e' e • ^ 

for the first row is: 



[0123] The other equations are constructed in the same way. For example, the equation for the fifth row is 



[0124] In step 608. we express the A/^ constraint equations on the region probabilities, and substitute in equations 
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relating the region probabilities to the messages in order to obtain a set of self-consistent equations (update rules) 
for messages 657. 

[0125] In our example, we have twelve constraint equations. In the canonical method, there are always exactly as 
many constraint equations as there are messages. One of the constraint equations in our example relates P($a,Stj,s^ 
sJ[QP(Si„sJ: 



P(Sy.S^)^J,PCs..S,.S,\sJ. [52] 



[0126] Using the results from the previous step, we obtain the self-consistent equation for the messages from the 
marginalization (abde) (be): 



[0127] Eleven other similar equations can be obtained from the constraint equations. From the marginalization of 
region (abde) -> (de) we obtain: 



[0128] From (bce1)^(beY 



[0129] From(i3ce/)-»(fe): 



[0130] From {degh) [de): 



[01 31 ] From (degh) ^ (he): 



[0132] From (efhi)^ (fe): 

M% is^ . s^)Ml (J J = ;S ^fi ('/ )^^« » )<t>s. {Sf . (5 J^/, is, , J J [59] 
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[0133] From [ef hi) (he): 
■'('/ 

[0134] From (be) (e): 

M'M^Y.^^^'^^^^^'i'^ihWi(.s,.s,)M [61] 
[0135] fxQmide)^(e): 

[0136] From (/e) ^ (e): 

^/ (0 = X • )^/ )^ A • ^. )^/^ . (^/ is J ) [63] . 

V 

[0137] From (he) — (e)- 

[0138] In step 609 we solve the self-consistent message equations 657 by iteratively updating the values according 
to the rules, beginning with some initial message values 658 and some "evidence" v659. The initial values 658 can 
be random or special values, e.g., all random positive numbers, or all ones. The "evidence" is given by the values of 
the v variables. 

[0139] For this sub-step, there are no guarantees of convergence, though for many cases in practice, such systems 
of equations will quickly converge. One can continue updating the message-update equations until they satisfy some 
termination condition, or alternatively stop after a given number of iterations. 

[0140] For our example, in order to obtain some numerical results, we need to specify the values of the variables 
defining the model . VVe choose the case where all the nodes can be in one of two states, and where the local potentials 
(Pij( S/,Sy) can be written as: 



^exp(7,/r)exp(-/,/r)^ 
^exp(-J../r)exp(/./r)^ 



[65] 



where J,^ is a bond strength and Tis the temperature parameter. This is known as the Ising model of statistical physics. 
We set all the bond strengths between nearest neighbors = 1 and the temperature 7=2. We also set the v/s^; 
"evidence" potentials to be uniform for all nodes h 
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[66] 



[0141] This corresponds lo the Ising model in a zero field. 

(0142) Convcrgcn: dynamics are obtained for this model by taking, at each iteration, the new messages to be an 
average of inc "old" messages and the new computed messages. The messages from one node to one node are 
dcicrmincd first and then these results are used to determine the two node to two node messages, in general, one 
can order ihc Ticssagc-update equations so that the messages that depend on the fewest nodes are solved first, so 
that those results can be used on the left-hand-side of the equations for the later messages. 

[01 43) Next we scivc for the region marginal probabilities 660 using the converged values of the messages and the 
eouaitons 656 prevtousty derived. We insure that the region probabilities are nonTialized so that they sum to one. 
[0144] intercstnqv lor our example, the final messages obtained from a "run" depends on the initial messages. 
However the tm^i convcrqed region probabilities derived from the final messages always are the same. This reflects 
a general pncnomcnrt wncn cs ine fact thai when there are more messages than regions, the messages form an over- 
complete reprcscriir-.tion o' mc region probabilities. 

(0145) The nurTK.-f c^j rc>u«i cbiaincd using our four cluster Kikuchi approximation are excellent and much belter 
than ttiobc obtrtiitc-o Nlttt tn,- pf io» n(\ paiiwise Bethe approximation. 

[0146) To trtVc or^c c. ^f-Tp»c tr%c ciacity computed value of the probability function P(s^,Sq) is, to six significant digits: 



0.406496 0.093504 ^ 



0.093504 0.406496 



I- 



[66] 



[0147) Our app*nim^ron yirrids 



(0AQ6659 0.09334n 
0.093341 0.406659 j" 



[67] 



While the rcsu t fr sunOHfc tx^iicf propagation, equivalent to using the Bethe approximation, is: 



r 0.365529 0.134471 
0.134471 0.365529 



[68] 



[0148) Thus our pcncra i/cd cciicf propagation method has eliminated more than 99.6 percent of the error associ- 
ated with the prior ari •stanavd* belief propagation in this case. This is a significant improvement over the prior art. 

[0149) Calculating oirtcf nurtic'icrtl qjarUitics gives similar improvements for our example. 

Determining the KIKuchi free energy 

[0150) So far we have ocscntxsd a process for generating message-passing methods. We have not yet mentioned 
the Kikuchi free energy c.iicuUtions that motivated our work. In fact, our method as described above has practical use 
by itself. On the othc h^nd fo* scmo applications, our method may be of interest to determined the Kikuchi free energy, 
as well as other physical cuanm es. such as the internal energy, the entropy, the specific heat, magnetizations, sus- 
ceptibilities, and possible cnticai lonporatures that can be derived from the Kikuchi free energy 

[01 51 ] The Kikuchi appr oitf-iHi on lo Ihe Gibbs free energy includes the following parts: the internal energy (enthalpy) 
U, entropicterm -T S and aqrango multiplier terms constraining the nonnalization and marginalizations of the region 
probabilities. The full formula (or the Kikuchi approximation to the Gibbs free energy is given in equation [14]. 
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[0152] For the internal energy and entropy parts, one must remember to weight each region's contribution by its 
over-counting number After the message-passing method has converged, the Lagrange multiplier terms all sum to 
zero and can be ignored, because the corresponding constraints are satisfied. 

Application of the canonical generalized belief propagation method to some important Markov networks 

[0153] It may be helpful if we explicitly express the equations obtained using the described canonical method for 
some common Markov networks. We consider a square lattice with clusters consisting of all the four-node squares in 
the lattice, and a triangular lattice with clusters consisting of all the three-node triangles in the lattice. 

Square lattice 

[01 54] In a square lattice of nodes each node is connected to its four nearest neighbors. Figure 2(a) is an example 
of a square lattice. This is an important Markov network for image processing or computer vision applications. Using 
each possible four-node loop as a cluster, we derive a set of message-passing rules which improve the accuracy of 
the computed marginal probabilities over the Bethe approximation's performance. To obtain improved performance, 
we introduce double-indexed messages, i.e., arrays of numbers, in addition to the single-indexed messages, i.e., vec- 
tors that are part of the Bethe approximation message-passing. 

[0155] Following the implementation rules described above, we derive the message-passing rules, and marginal 
probabilities in terms of those messages. These, we list below. We list only the marginalization and propagation equa- 
tions for the regions away from the edges of the array of nodes. 

[01 56] The regions obtained from the intersections of the original square clusters are regions consisting of two nodes 
connected by a link, and regions consisting of a single node. 

[0157] The marginal probability of a region consisting of a single node is expressed as: 

P{S,)=MlM'Xa^aVa' [73] 

Where P(SJ is the marginal probability for each of the different possible states S^. of the random variable at node a, 
and M^g is the converged value of the message from source node e to destination node a, and similarly for the messages 
from the other nodes neighboring node a. Each of those messages is a vector of the dimensionality of the number of 
different possible states at node a. To avoid cluUering these fomiulae, we have not suppressed the functional depend- 
ence of the messages; to make it explicit, one would write for example M^iSJ. Figure 8(a) is a graphical representation 
of equation [73]. Each arrow symbolizes a single-indexed message. 
[0158] The marginal probability of a region consisting of two linked nodes is expressed as: 

PrS,,S,)=A<A^/l<M>^/w2</W^Jcp,^Va¥6. [74] 

where the terms are defined analogously as with equation [73], and, for example, yw|^ means the double-indexed 
message from source nodes ch to destination nodes ab. Here, the value <Pa^ is the compatibility matrix between nodes 
a and b, and is a function of and S^,. Figure 8(b) is a graphical representation of Equation [74]. An arrow dragging 
a line symbolizes a double-indexed message, and a simple line connected two nodes symbolizes a compatibility matrix. 
[0159] The marginal probability of a square region is expressed as: 

P{S,,S,,S,S^} -f^K<^M^e^e<W^^ . [75] 

With the notation defined as for the equations above. Figure 8(c) is a graphical representation of Equation [75]. 
[0160] By marginalizing equation [741 over S^^and selling Ihe result equal lo equation [73], we obtain one of the two 
sets of self-consistent message equations that are satisfied by the converged values of the messages: 

^a-^b^A^IAat^t' . [76] 

[0161] Figure 9 is a graphical representation of equation [76], 

[0162] By marginalizing equation [75] over and S,, and setting the result equal to equation [74], we obtain the 
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other set of self-consistent message equations: 



^e,= " " • 177] 



[0163] For the values of /W^ and in equation [77], we use the values determined from equation [76]. Figure 10 is 
a graphical representation of equation [77]. 

Triangular lattice 

[0164] A triangular lattice is another Markov network of particular interest for modeling two dimensional datasets. 
Figure 2(b) is an example of a triangular lattice. Analogously with the equations and notation as described above, we 
described the following. 

[0165] The marginal probability of a single-node region is expressed as: 

P{SJ = t^MXKl^Ma'V,- [78] 
[0166] Figure 11(a) is a graphical representation of equation [78]. 

[0167] The marginal probability of a region consisting of two linked nodes is expressed as: 

P(S,,S,) = /l^A^A/^AyyX<^^A/X>ae<Pa.¥aM/.- [79] 

[0168] Figure 11 (b) is a graphical representation of equation [79). 
[0169] The marginal probability of a triangle region is expressed as: 

P(S„ S„ S,) = f^Al<<<<I^Xl^l^Ml^i^^Ke<^ae'PafP^^ [80] 
[0170] Figure 11 (c) is a graphical representation of equation [80], 

[0171] By marginalizing equation [79] over S^. and setting the result equal to [78], we obtain the self-consistent 
message equation: 

^a=^ ^e^e^e^e^e^ae^ae'^ae'^e [81] 

[0172] Figure 12 is a graphical representation of equation [81]. 

[0173] By marginalizing equation [80] over S„ and setting the result equal to equation [79], we obtain the self -con- 
sistent message equation: 

[0174] Figure 13 is a graphical representation of equation [82]. As described above for the square lattice, for the 
values of ^/Wand /W^ in equation [82], we use the values determined from equation [81] during the same iteration. 

Modified Loopy Super-node IWethod 

[0175] We now describe an alternative embodiment of our invention. We call this embodiment the "modified loopy 
super-node method." This method is partially motivated by Pearl's "clustering method", and the related "junction tree 
method" which are prior-art methods that has been employed to obtain exact answers for Markov networks with only 
a few loops, see "Learning in Graphical Models", edited by MJ. Jordan, 1998. 

[0176] In the junction tree method, one generates a Markov network equivalent to the original one by grouping the 
original nodes into "super-nodes." New (p compatibility matrices and y evidence functions are also generated for the 
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super-nodes such that the probability of any state using the super-node formulation is equivalent to its probability in 
the original Markov network. If the super-node Markov network has no loops, one can standard belief-propagation to 
delonmino exactly the desired marginal probabilities, even if the original Markov network had some loops. Using the 
standard bchof propagation method on a super-node network that was designed to have no loops is called the "junction 
Iroc method " 

[0177] 01 course, if the super-node network still has loops, one is still faced with a difficult problem which standard 
bcliol propnqation will not solve exactly. Nevertheless, one might consider the obvious "naive" method of using ordinary 
loopy bchcl propr.gaiion on the super-node network. This "naive" method, which one might call the "loopy super-node 
method " somcnncs gives acceptable results, but empirical testing shows that it also can give very poor results for 
some simple rctv;or<s 

[01 78] In our riiicrnativc embodiment of our invention, we make a modification to the loopy super-node method which 
guaraniccs Ih^I when it converges. Our modified method also gives results that are equivalent to using a Kikuchi 
approximntion with clusters corresponding to the chosen super-nodes. Thus, our "modified loopy super-node method" 
is an altcrnHirwc ncthod thai obtains the same results as our "canonical" method described above. 
[01 79] f- or concf cicncss wc illustrate the modified loopy super-node method using the same specific Markov network 
model and cMo<cc of dusters (hat wc used to illustrate the "canonical" method, see Figure 14. This network consists 
of nine nodes *h n'Ouqh /j. and wo choose four arbitrary clusters (abde), (beef), (degh), and (efhi). These four clusters 
arc also rcfcrrea lo as 'supcf noocs " 

Constructing a tuper-rxxie network 

[0180] Wc bcq^n t?y /ctofmui^triq the average energy U. The Kikuchi approximation for the average energy Is in fact 
exactly corrcc' w-ic wo u5c the corrcr! marginal probabilities. For our example network, the average energy is ex- 
pressed within the Kih Lichi .iporoim.ilion as: 



•fc. «A aj, a. 



[0181] This cxp'cssion includes terms that sum over the intersection regions (be), (de), (fe), (he), and (e).As for the 
canonical method nDovc we want to eliminate these terms and have an expression that only contains sums over the 
super-node rcqiois Wc can easily do that because each of the sums over the intersection regions can be expressed 
as a sum over a supc' node fcgion For example, 

X^(aw)^(o^*.)=Z^(«^.)^(«6,)- [84] 



[0182] This oquaion is Cue because of the marginatization formulae for P(a^e)' 

X/*(a^)£:(aw) = ZX/'(«^,)£KJ ' [85] 

= [86] 

= J,Eia^)Pia,J. [87] 



[0183] Therefore, the Kikuchi average energy can be expressed as: 
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"•^t* "6"/ O^tt, 

5 [01 84] The £ terms are not uniquely determined. They depend on how we choose to distribute the intersection region 
terms. One choice is: 
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Eic(^) = E(a^^)-Eia^) + E(a,) [89] 

E{a^^) = Eia^,,)-Eia,f) [90] 

E(ci,^)='E(.cx^J-E(a^) [91] 

Eia^) = E{cc^ ) - Eia^ ) . [92] 



[0185] Our goni is to construct a Markov network between the super-nodes that is equivalent to the original Markov 
network Lcl us liibci the four super-nodes by the numbers 1, 2. 3. and 4, see Figure 1 5a. Wo construct the evidence 
^5 luicl on& using (ho equation, for super-node 1. 

^,(ar,) = e-^"'•"^ [93] 



ard &im Uf 7 for He Other super-nodes. 

(0186] The compHiibilily matrices between the super-node regions are chosen so that if two super-nodes are in 
ccnsisicnt stHios then iho compatibility function is one, otherwise jhe compatibility function is zero. For example, if 
SLpc node ; s in H hrsl "super-node" state such that node b is in some first "node" state, and super-node 2 is in a 
second 'suoor ncdc* slate such that node bis in a second "node" state, than the two super-nodes are in inconsistent 
'sjpcf rooc' stHics The compatibility matrix ?»,i(ai .03) should equal zero for those particular states. 
[01871 " wo dcdne the evidence vectors and compatibility matrices as described, then the joint probability for super- 
node StrtlOS ts 



na„a^,ay,a,) = ^Y[f^{aS[[9^{a„,a,) [94] 

^ » mn 

(OHOd m arc supor nocc indices) will exactly agree with the joint probability P defined for the original nodes. 
[0186] Wo sjmman/c our construction in Figure 15a. We define a new "super-node" graph 1500 that has four super- 
nouc^ 1501-1504 1 - {Hb<Sc).Z^ {bce/),3^ (c/eg/i), and 4 = (e//7/). We denote the local "evidence potentials" for the 
toji super-nodes byr Wc set the paiwise connections (links) p1 51 1-1 51 4 between the super-nodes so that super- 
nodes must agree on shared nodes. By construction, the super-node network 1 5a has the same probability distribution 
as the original network of Figure 14. 

[0189J tn order that our super-node network is consistent with the Kikuchi approximation to the Gibbs free energy, 
we also must enlorcc all the constraints between the regions involved in the Kikuchi approximation. Wc aim for a 
minimal set of constraints (Lagrange multipliers). Note that in the canonical method, we used an over-complete set of 
constraints. The following set of marginalization constraints are sufficient for our example network: 
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Y,P{c(^)'=PM'='ZPicc^^) [95] 



J,Pic^^) = Pio^^) = J!,P{c^.fli) [96] 
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lLP(a^<) = Pia.J=I,Pia^,,} ' [97] 



X^(«*.*) = 'P(«*) = Z^(«^-w.) [98] 



Yp(o!u) = P(oc.), [99] 



in addition to the standard normalization constraints on the super-node regions. 

[0190] With these choices, the Kikuchi free energy for this example network takes the following fomn: 



50 



55 
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+ r 



X ) In P(a, ) +X P(«2 ) In P(a, )+ 5; P(o:3 ) In /'(af3 ) + X ^(^4 ) In P{a, ) 



+ rX^(aJln^'K) 



1- 



X^(«.)] 



+ 73 



+ 73 



+ XV(^»,)U^i,)-X^C^i) 



+ X^^».(«fc.) 



+ X'^wc(aA) 



i-X/'cajj 



+Ev^^*.) 
+X'U(^j 



+ X^4V.(ayJ 



/'(a.)-X^(«**) 



[100] 



[0191] When we differentiate the Kikuchi free energy with respect to the super-node region probabilities, we obtain: 
PW = ^9,icx,)A,^(a^)A,^(a„) [101] 



[102] 



[103] 



[104] 



Note that these equations are essentially identical to those that would be obtained from the naive loopy belief propa- 
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gation method on the super-node graph. For example, using the naive loopy belief propagation method, the marginal 
probability at super-node 1 would be; 



P(a,) = kip,(a,)M^(a,)M^ia,) , [105] 

where (a,) is the message from source super-node 2 to destination super-node /. At first sight, there is a difference 
with equation [101] because seems to depend on the state of all four nodes (abde). In fact, when we take the 

compatibility matrices into account, we finds that in the naive loopy belief propagation method, the messages between 
super-nodes only depend on the states of the nodes that they share. Thus we can make the identifications A/^ia^, ) - 

and /^(a/,e)=>.4\/Te(«/ie)- 

[0192] In Figure 15b,we draw graphically these messages passed in the naiVe loopy belief propagation method on 
the super-node graph. 

[0193] When we substitute the region probabilities from equations [101-104] into the constraint equations [95-98] 
we obtain: 



^lLy^M^(^^^^Uri^Jc)=^2XV^^ [106] 



^ X l^i («t)^ (<x,, ) = ^ £ t2^3 (0^)^ (^^)^. K J [1 07] 



^2lLy'2(oe^)^(a^)JlJ,^(a^J=k,J^WA^^^^^ [108] 



^11^3(^^^c(^J^(c^J==k,^¥^M^^^ [109] 



[0194] Those cquHttons will be satisfied when we make the identification between the Lagrange multipliers X and 
the loopy bci.cf p'oprtqHiion messages indicated previously, and require that the messages obey the standard propa- 
gnhoi cquiHiions 

[0195] Tnkc for cx^mDlc. equation [106]. We can simplify this as follows: 

^MiMl,^M)M^ia^) = k,Mi(ajY,^^^^ [HO] 
[0196] But if wc sr.tisfy the standard belief propagation equations on the super-node graph 



[111] 



[112] 
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then equation [110] will be satisfied. Thus equations [106-109] are equivalent to the naive loopy belief propagation 
equations for the super-node network. 

[0197] However, there exist additional equations that need to be considered in the Kikuchi formalism that give rise 
to a modification of the naive loopy belief propagation method. In particular, when we differentiate the Kikuchi approx- 
imation to the Gibbs free energy with respect to P{a^j}, P{a,J), P(a^J, Pta^^). and P(ag. we obtain the equations: 

^("fe)=^fe>^^fe(«te)>^4\fe(«te) [114] 
P(^he)=f<heh\hei^he)>^4\hei^he) [115] 

P((^e>f<e^be\ei<^e)- [117] 

[0198] The first three of these equations are satisfied automatically when we identify the X's with belief propagation 
messages as before. For example, for equation [113], we have: 

M. IW, IW. [118] 

[0199] However, equations [1 1 6-1 1 7] are not necessarily satisfied when we tal<e the X's to have the same values as 
the belief propagation messages. There is another condition, which can be obtained from the constraint equation 

Using this equation, we find: 

E^Vfrc(a^6c)^V^(^6.)-*. ' [119] 

where k is some constant. As usual, the actual value of the constant is not important; what is important in this case is 
the fact that this sum should not depend on the state of node e. We can guarantee that all the derivative equations 
[106-109] and [113-119] are satisfied if we identify the Vs with loopy belief propagation messages, but make sure that 
they satisfy the additional constraint of equation [119]. 

[0200] We now define a "normalization" operator. In Figures 15c and 15d, the nomnalization operator is indicated by 
ellipses 1530, 1541 -1544. The nomnalization operator takes as input the messages A^(ai,J and Milage) and returns 
as output normalized messages Norm{l\/P^{at,e)) and Norm{Ml(aij^)) defined by: 

NormiMliaJ) = ^^^"'^^ f 120] 
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NormiMi{a^)) = -=^kL =. [121] 



[0201] By construction: 

J,Norm(M^(aJ)Norm(M[(aJ) =1, [122] 



so if we identify the Xs with normalized beWel propagation messages, equation [119] will be satisfied. The normalization 
constraint [119] can be satisfied simultaneously with the ordinary belief propagation constraints implicit in equations 
[1 06-1 09], if we follow the modified loopy junction tree method (see figure 1 5c). For this example network, that means 
that we iteratively solve the belief-propagation equations for the super-node lanice, taking care at every Iteration to 
apply the normalization operator to messages A/Zgt^^fje) '^(c^te)- 

[0202] Applying this method to our example network yields results identical to those obtained from the canonical 
method, as expected. The results are significantly more accurate than those obtained from the "naive" loopy super- 
node method. 

The modified loopy super-node method for a general Markov network 

25 [0203] We have described the modified loopy super-node method for our example network. We now describe how 
to use this method for an arbitrary network with arbitrary groupings of nodes into super-nodes. A general version of 
the method Includes the following steps as shown in Figure 16. 

[0204] In step 1601, group all of the nodes of the network 1600 into a set of intersecting clusters 1650. As stated 
above, the grouping can be heterogeneous and arbitrary. As an advantage of the arbitrary clustering according to the 
30 invention, the user can trade accuracy for computational complexity. Larger clusters give better results, but take more 
time to resolve. Groups of nodes that are less relevant can be grouped into smaller clusters. 
[0205] In step 1602, determine a minimal number of marginallzation constraints 1651 that need to be satisfied be- 
tween the clusters 1650. 

[0206] In step 1 603, construct a super-node network 1 652. In the network 1 652, each cluster of nodes is represented 
35 by just one super-node 1 653. Super-nodes that share one of the marginalization constraints determined In step 1 602 
are connected by a super-link 1654. 

[0207] In step 1604, the super-node network 1 652 is searched to locate closed loops of super-nodes that contain at 
least one common node. For each such closed loop, determine a normalization operator 1655. 
[0208] In step 1605, update the messages between super-nodes using standard belief propagation. 
40 [0209] In step 1606, replace the messages by their nomnallzed values using the corresponding normalization oper- 
ator, and repeat step 1 605 until convergence in step 1607. 

[0210] If we have a fixed point of the modified loopy super-node method, then It Is equivalent to the result obtained 
from mininnizing the KIkuchi free energy 

•^5 Application to Decoding Error-Correcting Codes 

[0211] An "error-correcting" code is an encoding of digital, e.g., binary, messages. The code adds redundancy that 
is designed to protect the transmission of the messages from corruption by noise. Typically, the sender of a message 
transmits "blocks" of bits (binary digits), where each block contains the "data bits" of the original message, plus some 
additional "check bits" which help to decode the message if it Is arrives at a receiver in a corrupted form. For example, 
a check bit may encode the parity of the sum of selected data bits. 
[0212] Markov networks have been widely used to represent error-correcting codes, sec "Good Error-Correcting 
Codes based on Very Sparse Matrices", by D, MacKay. IEEE Transactions on Information Theory, IVIarch, 1999. In the 
Markov network formulation of an error-correcting code, some of the nodes of the Markov network will correspond to 
55 the data bits and the check bits, and the links between the nodes will enforce the appropriate relationships between 
data bits and check bits. 

[021 3] The standard decoding algorithm used for some errpr-correcting codes is the loopy belief propagation method, 
using either marginal or MAP probabilities, run on the Markov network that corresponds to the error-correcting code, 
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see R. WcEliece, D. MacKay, and J. Cheng, IEEE J. Selected Areas in Communication, 1997. When the MAP version 
of belief propagation is used, the decoded states of the bits are just the MAP states of the corresponding nodes. When 
the marginal probability version of belief propagation is used, the decoded states of the bits are taken to be those 
states of the corresponding nodes that have the highest marginal probability. 

[0214] In decoding according to our invention, the loopy belief propagation method is replaced by the method ac- 
cording to the invention. Our method provides improved decoding for error-correcting codes. 

Other Applications 

[0215] The belief propagation method as described herein can be used in a number of applications. It can be used 
In computer vision problems where scenes need to be estimated, see U.S. Patent Application Sn. 09/203,108, "Esti- 
mating Scenes Using Statistical Properties of Images and Scenes," filed by Freeman et al. on November 30, 1998, 
also see U.S. Patent Application Sn. 09/236,839, "Estimating Targets Using Statistical Properties of Observations of 
Known Targets." filed by Freeman et al. on January 25. 1999. It can also be used In object recognition systems such 
as the air-to-ground targeting system described in U.S. Patent No. 5,963,653, "Hierarchical infonmatton fusion object 
recognition system and method/' issued to McNary et al. on October 5, 1 999. A number of other applications that are 
well suited for the invention are mentioned in U.S. Patent 5,802,256, "Generating Improved belief networks," issued 
to Heckerman. et al. on September 1, 1998, for example, crop production estimation, program debugging, coastal 
ocean environments prediction, diagnosing linear lightwave networks. Speech recognition applications that can use 
the present invention are described in U.S. Patent 5,623,609, "Computer system and computer-implemented process 
for phonology-based automatic speech recognition," issued to Kaye et al. on April 22, 1997. 

[0216] In this description of the invention, we used specific terms and examples. It is to be understood that various 
other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object 
of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the 
invention. 



Claims 

1 . A method for dcienmining probabilities of states of a system represented by a model including a plurality of nodes 
connected by links, each node representing possible states of a corresponding part of the system, and each link 
representing statistical dependencies between possible states of related nodes, comprising: 

grouping the plurality of nodes into arbitrary-sized clusters such that every node is included in at least one 
cluster and each link is completely contained in at least one cluster; 

defining messages based on the arbitrary-sized clusters, each message having associated sets of source 
nodes and destination nodes and a value and a rule depending on other messages and selected links con- 
necting the source nodes and destination nodes; 
assigning Initial values to the messages; 

updating the value of each message using the associated rule; and 

determining approximate probabilities of the states of the system from the messages when a termination con- 
dition is reached. 

2. The method of claim 1 further comprising: 

identifying nodes in intersections of clusters, and intersections of intersections of clusters as regions of nodes; 
defining the messages based on the regions of nodes. 

3. The method of claim 1 wherein the network has pair-wise statistical dependencies between nodes, and the overall 
probability of a particular assignment of states s at the nodes is: 

^ ij 

where the first product runs over all linked neighboring nodes, / and /, and wherein a ^ compatibility matrix repre- 
sents the statistical dependencies between the possible states s of the related nodes, and the y function for each 
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node represents evidence that a particular node is in a particular state, arid Zis a normalization constant to insure 
that the sunn of the probabilities of all possible states of the system is equal to one. 

4. The method of claim 1 wherein the Initial values of the messages are random positive numbers. 

5. The method of claim 1 wherein the Initial values of the messages are all ones. 

6. The method of claim 1 wherein the termination condition is a convergence the probabilities of the states of the 
system to a predetermined precision. 

7. The method of claim 1 wherein the approximate probabilities are marginal probabilities. 

8. The method of claim 1 wherein the approximate probabilities are maximum a posteriori probabilities. 

9. The method of claim 1 wherein the states are discrete. 

10. The method of claim 1 wherein the states are continuous. 



11. The method of claim 1 wherein the network model Includes closed loops. 

12. The method of claim 1 wherein the nodes are arranged in a square lattice. 

13. The method of claim 1 wherein the nodes are arranged in a triangular lattice. 

14. The method of claim 1 wherein the nodes and links are a Markov network representation of an error-correcting code. 

1 5. A method for determining probabilities of states of a system represented by a model Including a plurality of nodes 
connected by links, each node representing possible states of a corresponding part of the system, and each link 
representing statistical dependencies between possible states of neighboring nodes, comprising: 

grouping the plurality of nodes Into arbitrary-sized clusters such that every node is included In at least one 
cluster, and each link is completed contained In at least one cluster; 

identifying nodes in intersecting clusters, and intersections of Intersecting clusters as regions, and intersections 

of regions as sub-regions; 

discarding duplicate regions and sub-regions; 

arranging the regions and sub-regions in a top-to-bottom hierarchy of intersections; 

defining messages between regions and direct sub-regions directly connected In the hierarchy, each message 
having associated sets of source nodes and destination nodes and a value and a rule depending on other 
messages and selected links connecting the source nodes and destination nodes, the destination nodes being 
those nodes in the sub-region, and the source nodes being those nodes in the region and outside the sub- 
region; 

assigning initial values to the messages; 

updating the value of each message using the associated rule until a temninatlon condition is reached; 
determining approximate probabilities of the slates of the system from the messages when a tenmination con- 
dition is reached. 
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